This is a first pass exploration of different aspects of beer.
- The main question on the table is this:
- Are beer styles actually indicative of shared attributes of the beers within that style? Or are style boundaries more or less arbitrary?
- Two approaches: clustering and prediction
- Clustering: are there natural clusters across the spectum of beers that align well with the styles they’re grouped into?
- Unsupervised (k-means) clustering based on
- ABV (alcohol by volume), IBU (international bitterness units), SRM (measure of color)
- How well do these match up with various “style centers,” defined by mean of ABV, IBU, and SRM per beer style
- Prediction: can we predict a beer’s style based on certain characteristics of the beer?
- Answer thus far
- Beer-intrinsic attributes aren’t great predictors of style
- Style-defined attributes are the best predictors
- For instance, the glass a beer is served in (which is defined by its style) is a much better predictor of its style than actual characteristics of the beer like ABV and even the number of different types of hops it contains
Workflow Overview
- Hit the BreweryDB API to iteratively pull in all beers and their ingredients along with other things we might want like breweries and glassware
- Unnest the JSON responses, including all the ingredients columns, and
Dump this all into a MySQL database
- Create a
style_collapsed column to reduce the number of levels of our outcome variable
grep through each beer’s style to determine if that style contains a keyword that qualifies it to be rolled into a collapsed style
- If it does, it gets that keyword in a
style_collapsed column
- Further collpase styles that are similar like Hefeweizen and Wit into Wheat
- Unnest the ingredients
hops and malts into a sparse matrix
- Individual ingredients as columns, beers as rows; cell gets a 1 if ingredient is present and 0 otherwise
Cluster: unsupervised k-means clsutering based on ABV, IBU, and SRM
- Run a neural net
- Predict either
style or style_collapsed from all the predictors including the total number of hops and malts per beer
Short Aside
The question of what should be a predictor variable for style is a bit murky here. What should be fair game for predicting style and what shouldn’t? Characteristics of a beer that are defined by its style would seem to be “cheating” in a way.
- Main candidates are:
- ABV (alcohol by volume), IBU (international bitterness units), SRM (standard reference measure, a scale of beer color from light to dark)
- These are outputs of a beer that meaningfully define the beer and are theoretically orthogonal to each other
- Ingredients in a beer such as hops and malts
- Inputs to a beer that have some effect on its flavor profile
- Semi-cheating because if style is determined beforehand it likely determines at least in part which ingredients are added
- Glass type
- This is defined entirely by style and is very predictive of it
This document compiled by querying the beer database I built, specifically by sourcing the file read_from_db.R. This is done for expediency’s sake, (the code below detailing how to get the beer, run in full in run_it.R, takes some time to execute).
Get and Prepare Data
Getting beer, the age-old dilemma
- The BreweryDB API returns a certain number of results per page; if we want
- So, we hit the BreweryDB API and ask for
1:number_of_pages
- We can change
number_of_pages to, e.g., 3 if we only want the first 3 pages
- If there’s only one page (as is the case for the glassware endpoing), numberOfPages won’t be returned, so in this case we set number_of_pages to 1
- The
addition parameter can be an empty string if nothing else is needed
base_url <- "http://api.brewerydb.com/v2"
key_preface <- "/?key="
paginated_request <- function(ep, addition) {
full_request <- NULL
first_page <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
, "&p=1"))
number_of_pages <- ifelse(!(is.null(first_page$numberOfPages)),
first_page$numberOfPages, 1)
for (page in 1:number_of_pages) {
this_request <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
, "&p=", page, addition),
flatten = TRUE)
this_req_unnested <- unnest_it(this_request) # <- request unnested here
print(this_req_unnested$currentPage)
full_request <- bind_rows(full_request, this_req_unnested[["data"]])
}
full_request
}
all_beer_raw <- paginated_request("beers", "&withIngredients=Y")
- Function for unnesting JSON used inside
paginated_request() below
- Takes the column named
name nested within a column in the data portion of the response
- If the
name column doesn’t exist, it takes the first nested column
- We use something similar to unnest ingredient like all of a beer’s hops and malts into a long string contained in
hops_name and malt_name
unnest_it <- function(df) {
unnested <- df
for(col in seq_along(df[["data"]])) {
if(! is.null(ncol(df[["data"]][[col]]))) {
if(! is.null(df[["data"]][[col]][["name"]])) {
unnested[["data"]][[col]] <- df[["data"]][[col]][["name"]]
} else {
unnested[["data"]][[col]] <- df[["data"]][[col]][[1]]
}
}
}
unnested
}
Collapse Styles
- Save the most popular styles in
keywords
- Loop through each keyword
- For each beer,
grep through its style column to see if it contains any one of these keywords
- If it does, give it that keyword in a new column
style_collapsed
- If a beer’s name matches multiple keywords, e.g., American Double India Pale Ale would match Double India Pale Ale, India Pale Ale, and Pale Ale, its
style_collapsed is the last of those that appear in keywords
- This is why keywords are intentionally ordered from most general to most specific
- So in the case of an case of American Double India Pale Ale: since Double India Pale Ale appears in
keywords after India Pale Ale and Pale Ale, an American Double India Pale Ale would get a style_collapsed of Double India Pale Ale
- If no keyword is contained in
style, style_collapsed is just whatever’s in style; in other words, it doesn’t get collpsed into a bigger bucket
- This isn’t a huge problem because we’ll pare down to just the most popular styles later, however we could think about creating a catchall “Other” level for
style_collapsed
collapse_styles <- function(df) {
keywords <- c("Lager", "Pale Ale", "India Pale Ale", "Double India Pale Ale", "India Pale Lager", "Hefeweizen", "Barrel-Aged","Wheat", "Pilsner", "Pilsener", "Amber", "Golden", "Blonde", "Brown", "Black", "Stout", "Porter", "Red", "Sour", "Kölsch", "Tripel", "Bitter", "Saison", "Strong Ale", "Barley Wine", "Dubbel", "Altbier")
for (beer in 1:nrow(df)) {
if (grepl(paste(keywords, collapse="|"), df$style[beer])) {
for (keyword in keywords) {
if(grepl(keyword, df$style[beer]) == TRUE) {
df$style_collapsed[beer] <- keyword
}
}
} else {
df$style_collapsed[beer] <- as.character(df$style[beer])
}
print(df$style_collapsed[beer])
}
return(df)
}
- Then we collapse further; right now we just combine all wheaty bears into Wheat and Pils-like beers into Pilsener (with two e’s) by
fct_collapseing those levels
collapse_further <- function(df) {
df[["style_collapsed"]] <- df[["style_collapsed"]] %>%
fct_collapse(
"Wheat" = c("Hefeweizen", "Wheat"),
"Pilsener" = c("Pilsner", "American-Style Pilsener") # pilsener == pilsner == pils
)
return(df)
}
Split out Ingredients
- When we unnested ingredients, we just concatenated all of the ingredients for a given beer into a long string
- If we want, we can split out the ingredients that were concatenated in
<ingredient>_name with this split_ingredients function
This takes a vector of ingredients_to_split, so e.g. c("hops_name", "malt_name") and creates one column for each type of ingredient (hops_name_1, hops_name_2, etc.)
- We
str_split on the ingredient and get a list back
- We find the max number of instances of an ingredient per beer, which will be the number of columns we’re adding
- For each new column we need, we create it, initialize it with NAs, and name it
Then for each element in our list of split up ingredients, if it exists, we add it to the correct column in our df
split_ingredients <- function(df, ingredients_to_split) {
ncol_df <- ncol(df)
for (ingredient in ingredients_to_split) {
ingredient_split <- str_split(df[[ingredient]], ", ")
num_new_cols <- max(lengths(ingredient_split))
for (num in 1:num_new_cols) {
this_col <- ncol_df + 1
df[, this_col] <- NA
names(df)[this_col] <- paste0(ingredient, "_", num)
ncol_df <- ncol(df)
for (row in seq_along(ingredient_split)) {
if (!is.null(ingredient_split[[row]][num])) {
df[row, this_col] <- ingredient_split[[row]][num]
}
}
df[[names(df)[this_col]]] <- factor(df[[names(df)[this_col]]])
}
ncol_df <- ncol(df)
}
return(df)
}
Find the Most Popualar Styles
- Find mean ABV, IBU, and SRM per collapsed style
- Arrange collapsed styles by the number of beers that fall into them
- This is of course dependent on how we collapse styles
library(forcats)
# Pare down to only cases where style is not NA
beer_dat <- beer_necessities
beer_dat_pared <- beer_dat[complete.cases(beer_dat$style), ]
# Arrange beer dat by style popularity
style_popularity <- beer_dat_pared %>%
group_by(style) %>%
count() %>%
arrange(desc(n))
style_popularity
# Add a column that scales popularity
style_popularity <- bind_cols(style_popularity,
n_scaled = as.vector(scale(style_popularity$n)))
# Find styles that are above a z-score of 0
popular_styles <- style_popularity %>%
filter(n_scaled > 0)
# Pare dat down to only beers that fall into those styles
popular_beer_dat <- beer_dat_pared %>%
filter(
style %in% popular_styles$style
) %>%
droplevels() %>%
as_tibble()
nrow(popular_beer_dat)
# Find the centers (mean abv, ibu, srm) of the most popular styles
style_centers <- popular_beer_dat %>%
group_by(style_collapsed) %>%
add_count() %>%
summarise(
mean_abv = mean(abv, na.rm = TRUE),
mean_ibu = mean(ibu, na.rm = TRUE),
mean_srm = mean(srm, na.rm = TRUE),
n = median(n, na.rm = TRUE) # Median here only for summarise. Should be just the same as n
) %>%
arrange(desc(n)) %>%
drop_na() %>%
droplevels()
# Give some nicer names
style_centers_rename <- style_centers %>%
rename(
`Collapsed Style` = style_collapsed,
`Mean ABV` = mean_abv,
`Mean IBU` = mean_ibu,
`Mean SRM` = mean_srm,
`Numer of Beers` = n
)
Take a look at the table
kable(style_centers_rename)
| India Pale Ale |
6.578468 |
66.04268 |
9.989313 |
6524 |
| Pale Ale |
5.695480 |
40.86930 |
8.890306 |
4280 |
| Stout |
7.991841 |
43.89729 |
36.300000 |
4238 |
| Wheat |
5.158040 |
17.47168 |
5.861842 |
3349 |
| Double India Pale Ale |
8.930599 |
93.48142 |
11.006873 |
2525 |
| Red |
5.742565 |
33.81127 |
16.178862 |
2521 |
| Lager |
5.453718 |
30.64361 |
8.457447 |
2230 |
| Saison |
6.400189 |
27.25114 |
7.053476 |
2167 |
| Blonde |
5.595298 |
22.39432 |
5.625000 |
2044 |
| Porter |
6.182049 |
33.25369 |
32.197605 |
1973 |
| Brown |
6.159212 |
32.21577 |
23.592000 |
1462 |
| Pilsener |
5.227593 |
33.51346 |
4.413462 |
1268 |
| Specialty Beer |
6.446402 |
33.77676 |
15.520548 |
1044 |
| Bitter |
5.322364 |
38.28175 |
12.460526 |
939 |
| Fruit Beer |
5.195222 |
19.24049 |
8.666667 |
905 |
| Herb and Spice Beer |
6.621446 |
27.77342 |
18.166667 |
872 |
| Sour |
6.224316 |
18.88869 |
10.040816 |
797 |
| Strong Ale |
8.826425 |
36.74233 |
22.547945 |
767 |
| Tripel |
9.029775 |
32.51500 |
7.680556 |
734 |
| Black |
6.958714 |
65.50831 |
31.080000 |
622 |
| Barley Wine |
10.781600 |
74.04843 |
19.561404 |
605 |
| Kölsch |
4.982216 |
23.37183 |
4.371795 |
593 |
| Barrel-Aged |
9.002506 |
39.15789 |
18.133333 |
540 |
| Other Belgian-Style Ales |
7.516318 |
37.55812 |
17.549020 |
506 |
| Pumpkin Beer |
6.712839 |
23.48359 |
17.918033 |
458 |
| Dubbel |
7.509088 |
25.05128 |
22.940000 |
399 |
| Scotch Ale |
7.620233 |
26.36909 |
24.222222 |
393 |
| German-Style Doppelbock |
8.045762 |
28.88692 |
25.696970 |
376 |
| Fruit Cider |
6.205786 |
25.60000 |
12.000000 |
370 |
| German-Style Märzen |
5.746102 |
25.63796 |
14.322581 |
370 |
Now that the munging is done, onto the main question: do natural clusters in beer align with style boundaries?
Ingredients
To get more granular with ingredients, we can split out each individual ingredient into its own column. If a beer or style contains that ingredient, its row gets a 1 in that ingredient column and a 0 otherwise.
From this, we can find the total number of hops and malts per grouper.
- The dataframe we’ll use will be
beer_necessities
- This function takes a dataframe and two other parameters set at the outset:
ingredient_want: this can be hops, malt, or other ingredients like yeast if we pull that in
grouper: can be a vector of one or more things to group by, like beer name or style
pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
# ----------------------- Setup --------------------------- #
# We've already split ingredient number names out from the concatenated string into columns like `malt_name_1`,
# `malt_name_2`, etc. We need to find the range of these columns; there will be a different number of malt
# columns than hops columns, for instance. The first one will be `<ingredient>_name_1` and from this we can find
# the index of this column in our dataframe. We get the name of last one with the `get_last_ing_name_col()`
# function. Then we save a vector of all the ingredient column names in `ingredient_colnames`. It will stay
# constant even if the indices change when we select out certain columns.
# First ingredient
first_ingredient_name <- paste(ingredient_want, "_name_1", sep="")
first_ingredient_index <- which(colnames(df)==first_ingredient_name)
# Get the last ingredient
get_last_ing_name_col <- function(df) {
for (col in names(df)) {
if (grepl(paste(ingredient_want, "_name_", sep = ""), col) == TRUE) {
name_last_ing_col <- col
}
}
return(name_last_ing_col)
}
# Last ingredient
last_ingredient_name <- get_last_ing_name_col(df)
last_ingredient_index <- which(colnames(df)==last_ingredient_name)
# Vector of all the ingredient column names
ingredient_colnames <- names(df)[first_ingredient_index:last_ingredient_index]
# Non-ingredient column names we want to keep
to_keep_col_names <- c("cluster_assignment", "name", "abv", "ibu", "srm", "style", "style_collapsed")
# -------------------------------------------------------------------------------#
# Inside `gather_ingredients()` we take out superflous column names that are not in `to_keep_col_names` or one
# of the ingredient columns, find what the new ingredient column indices are, since they'll have changed after
# we pared down and then gather all of the ingredient columns (e.g., `hops_name_1`) into one long column,
# `ing_keys` and all the actual ingredient names (e.g., Cascade) into `ing_names`.
# ----------------------------- Gather columns --------------------------------- #
gather_ingredients <- function(df, cols_to_gather) {
to_keep_indices <- which(colnames(df) %in% to_keep_col_names)
selected_df <- df[, c(to_keep_indices, first_ingredient_index:last_ingredient_index)]
new_ing_indices <- which(colnames(selected_df) %in% cols_to_gather) # indices will have changed since we pared down
df_gathered <- selected_df %>%
gather_(
key_col = "ing_keys",
value_col = "ing_names",
gather_cols = colnames(selected_df)[new_ing_indices]
) %>%
mutate(
count = 1
)
df_gathered
}
beer_gathered <- gather_ingredients(df, ingredient_colnames) # ingredient colnames defined above function
# ------------------------------------------------------------------------------- #
# Next we get a vector of all ingredient levels and take out the one that's an empty string and
# use this vector of ingredient levels in `select_spread_cols()` below
# Get a vector of all ingredient levels
beer_gathered$ing_names <- factor(beer_gathered$ing_names)
ingredient_levels <- levels(beer_gathered$ing_names)
# Take out the level that's just an empty string
to_keep_levels <- !(c(1:length(ingredient_levels)) %in% which(ingredient_levels == ""))
ingredient_levels <- ingredient_levels[to_keep_levels]
beer_gathered$ing_names <- as.character(beer_gathered$ing_names)
# ----------------------------------------------------------------------------- #
# Then we spread the ingredient names: we take what was previously the `value` in our gathered dataframe, the
# actual ingredient names (Cascade, Centennial) and make that our `key`; it'll form the new column names. The
# new `value` is `value` is count; it'll populate the row cells. If a given row has a certain ingredient, it
# gets a 1 in the corresponding cell, an NA otherwise.
# We add a unique idenfitier for each row with `row`, which we'll drop later (see [Hadley's SO
# comment](https://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr)).
# ------------------------------- Spread columns -------------------------------- #
spread_ingredients <- function(df) {
df_spread <- df %>%
mutate(
row = 1:nrow(df) # Add a unique idenfitier for each row which we'll need in order to spread; we'll drop this later
) %>%
spread(
key = ing_names,
value = count
)
return(df_spread)
}
beer_spread <- spread_ingredients(beer_gathered)
# ------------------------------------------------------------------------------- #
# ------------------------- Select only certain columns ------------------------- #
select_spread_cols <- function(df) {
to_keep_col_indices <- which(colnames(df) %in% to_keep_col_names)
to_keep_ingredient_indices <- which(colnames(df) %in% ingredient_levels)
to_keep_inds_all <- c(to_keep_col_indices, to_keep_ingredient_indices)
new_df <- df %>%
select_(
.dots = to_keep_inds_all
)
return(new_df)
}
beer_spread_selected <- select_spread_cols(beer_spread)
# ------------------------------------------------------------------------------- #
# Take out all rows that have no ingredients specified at all
inds_to_remove <- apply(beer_spread_selected[, first_ingredient_index:last_ingredient_index],
1, function(x) all(is.na(x)))
beer_spread_no_na <- beer_spread_selected[ !inds_to_remove, ]
# ----------------- Group ingredients by the grouper specified ------------------- #
# Then we do the final step and group by the groupers.
get_ingredients_per_grouper <- function(df, grouper = grouper) {
df_grouped <- df %>%
ungroup() %>%
group_by_(grouper)
not_for_summing <- which(colnames(df_grouped) %in% to_keep_col_names)
max_not_for_summing <- max(not_for_summing)
per_grouper <- df_grouped %>%
select(-c(abv, ibu, srm)) %>% # taking out temporarily
summarise_if(
is.numeric,
sum, na.rm = TRUE
# -c(abv, ibu, srm)
) %>%
mutate(
total = rowSums(.[(max_not_for_summing + 1):ncol(.)], na.rm = TRUE)
)
# Send total to the second position
per_grouper <- per_grouper %>%
select(
name, total, everything()
)
# Replace total column with more descriptive name: total_<ingredient>
names(per_grouper)[which(names(per_grouper) == "total")] <- paste0("total_", ingredient_want)
return(per_grouper)
}
# ------------------------------------------------------------------------------- #
ingredients_per_grouper <- get_ingredients_per_grouper(beer_spread_selected, grouper)
return(ingredients_per_grouper)
}
- Now run the function with
ingredient_want as first hops, then malt
- Then join the resulting dataframes and remove/reorder some columns
# Run the entire function with ingredient_want set to hops, grouping by name
ingredients_per_beer_hops <- pick_ingredient_get_beer(ingredient_want = "hops",
beer_necessities,
grouper = c("name", "style_collapsed"))
# Same for malt
ingredients_per_beer_malt <- pick_ingredient_get_beer(ingredient_want = "malt",
beer_necessities,
grouper = c("name", "style_collapsed"))
# Join those on our original dataframe by name
beer_ingredients_join_first_ingredient <- left_join(beer_necessities, ingredients_per_beer_hops,
by = "name")
beer_ingredients_join <- left_join(beer_ingredients_join_first_ingredient, ingredients_per_beer_malt,
by = "name")
# Take out some unnecessary columns
unnecessary_cols <- c("styleId", "abv_scaled", "ibu_scaled", "srm_scaled",
"hops_id", "malt_id", "glasswareId", "style.categoryId")
beer_ingredients_join <- beer_ingredients_join[, (! names(beer_ingredients_join) %in% unnecessary_cols)]
# If we also want to take out any of the malt_name_1, malt_name_2, etc. columns we can do this with a grep
more_unnecessary <- c("hops_name_|malt_name_")
beer_ingredients_join <-
beer_ingredients_join[, (! grepl(more_unnecessary, names(beer_ingredients_join)) == TRUE)]
# Reorder columns a bit
beer_ingredients_join <- beer_ingredients_join %>%
select(
id, name, total_hops, total_malt, everything(), -description
)
# And get a df that includes total_hops and total_malt but not all the other ingredient columns
beer_totals <- beer_ingredients_join %>%
select(
id, name, total_hops, total_malt, style, style_collapsed,
abv, ibu, srm, glass, hops_name, malt_name
)
Now we’re left with something of a sparse matrix of all the ingredients compared to all the beers
kable(beer_ingredients_join[1:20, ])
Unsupervised Clustering
We K-means cluster beers based on certain numeric predictor variables.
Prep
- Write a funciton that takes a dataframe, a set of predictors, a response variable, and the number of cluster centers you want
- NB: There are not not very many beers have SRM so we may not want to omit based on it
- Take out missing values, and scale the data
- Take out outliers, defined as beers have to have an ABV between 3 and 20 and an IBU less than 200
- Then cluster on just the predictors and compare to the response variable
library(NbClust)
cluster_it <- function(df, preds, to_scale, resp, n_centers) {
df_for_clustering <- df %>%
select_(.dots = c(response_vars, cluster_on)) %>%
na.omit() %>%
filter(
abv < 20 & abv > 3
) %>%
filter(
ibu < 200
)
df_all_preds <- df_for_clustering %>%
select_(.dots = preds)
df_preds_scale <- df_all_preds %>%
select_(.dots = to_scale) %>%
rename(
abv_scaled = abv,
ibu_scaled = ibu,
srm_scaled = srm
) %>%
scale() %>%
as_tibble()
df_preds <- bind_cols(df_preds_scale, df_all_preds[, (!names(df_all_preds) %in% to_scale)])
df_outcome <- df_for_clustering %>%
select_(.dots = resp) %>%
na.omit()
set.seed(9)
clustered_df_out <- kmeans(x = df_preds, centers = n_centers, trace = TRUE)
clustered_df <- as_tibble(data.frame(
cluster_assignment = factor(clustered_df_out$cluster),
df_outcome, df_preds,
df_for_clustering %>% select(abv, ibu, srm)))
return(clustered_df)
}
Cluster
First we’ll run the fuction with 10 centers, and cluster on the predictors ABV, IBU, SRM, total_hops, and total_malt.
cluster_on <- c("abv", "ibu", "srm", "total_hops", "total_malt")
to_scale <- c("abv", "ibu", "srm", "total_hops", "total_malt")
response_vars <- c("name", "style", "style_collapsed")
clustered_beer <- cluster_it(df = beer_totals,
preds = cluster_on,
to_scale = to_scale,
resp = response_vars,
n_centers = 10)
KMNS(*, k=10): iter= 1, indx=5
QTRAN(): istep=4472, icoun=1
QTRAN(): istep=8944, icoun=84
QTRAN(): istep=13416, icoun=31
QTRAN(): istep=17888, icoun=310
QTRAN(): istep=22360, icoun=444
QTRAN(): istep=26832, icoun=3571
KMNS(*, k=10): iter= 2, indx=28
QTRAN(): istep=4472, icoun=3
QTRAN(): istep=8944, icoun=26
QTRAN(): istep=13416, icoun=0
QTRAN(): istep=17888, icoun=45
QTRAN(): istep=22360, icoun=37
QTRAN(): istep=26832, icoun=170
QTRAN(): istep=31304, icoun=45
QTRAN(): istep=35776, icoun=64
QTRAN(): istep=40248, icoun=692
KMNS(*, k=10): iter= 3, indx=6
QTRAN(): istep=4472, icoun=16
QTRAN(): istep=8944, icoun=36
QTRAN(): istep=13416, icoun=318
QTRAN(): istep=17888, icoun=1114
QTRAN(): istep=22360, icoun=1628
QTRAN(): istep=26832, icoun=1518
KMNS(*, k=10): iter= 4, indx=227
QTRAN(): istep=4472, icoun=560
QTRAN(): istep=8944, icoun=1426
QTRAN(): istep=13416, icoun=401
KMNS(*, k=10): iter= 5, indx=4472
Head of the clustering data
kable(clustered_beer[1:20, ])
A table of cluster counts broken down by style
cluster_table_counts <- table(style = clustered_beer$style_collapsed, cluster = clustered_beer$cluster_assignment)
kable(cluster_table_counts)
Plot the clusters. There are 3 axes: ABV, IBU, and SRM, so we choose two at a time.
clustered_beer_plot_abv_ibu <- ggplot(data = clustered_beer, aes(x = abv, y = ibu, colour = cluster_assignment)) +
geom_jitter() + theme_minimal() +
ggtitle("k-Means Clustering of Beer by ABV, IBU, SRM") +
labs(x = "ABV", y = "IBU") +
labs(colour = "Cluster Assignment")
clustered_beer_plot_abv_ibu

clustered_beer_plot_abv_srm <- ggplot(data = clustered_beer, aes(x = abv, y = srm, colour = cluster_assignment)) +
geom_jitter() + theme_minimal() +
ggtitle("k-Means Clustering of Beer by ABV, IBU, SRM") +
labs(x = "ABV", y = "SRM") +
labs(colour = "Cluster Assignment")
clustered_beer_plot_abv_srm

Now we can add in the style centers (means) for each style_collapsed and label it.
library(ggrepel)
abv_ibu_clusters_vs_style_centers <- ggplot() +
geom_point(data = clustered_beer,
aes(x = abv, y = ibu, colour = cluster_assignment), alpha = 0.5) +
geom_point(data = style_centers,
aes(mean_abv, mean_ibu), colour = "black") +
geom_text_repel(data = style_centers, aes(mean_abv, mean_ibu, label = style_collapsed),
box.padding = unit(0.45, "lines"),
family = "Calibri",
label.size = 0.3) +
ggtitle("Popular Styles vs. k-Means Clustering of Beer by ABV, IBU, SRM") +
labs(x = "ABV", y = "IBU") +
labs(colour = "Cluster Assignment") +
theme_bw()
abv_ibu_clusters_vs_style_centers

The clustering above used a smaller number of clusters (10) than there are styles_collapsed. That makes it difficult to determine whether a given style fits snugly into a cluster or not.
Cluster on just certain selected styles
We’ll take five very distinct collapsed styles and re-run the clustering on beers that fall into these categories. These styles were intentionally chosen because they are quite distinct: Blonde, IPA, Stout, Tripel, Wheat. Arguably, of these five styles Blondes and Wheats are the closest
styles_to_keep <- c("Blonde", "India Pale Ale", "Stout", "Tripel", "Wheat")
bt_certain_styles <- beer_totals %>%
filter(
style_collapsed %in% styles_to_keep
)
cluster_on <- c("abv", "ibu", "srm", "total_hops", "total_malt")
to_scale <- c("abv", "ibu", "srm", "total_hops", "total_malt")
response_vars <- c("name", "style", "style_collapsed")
certain_styles_clustered <- cluster_it(df = bt_certain_styles,
preds = cluster_on,
to_scale = to_scale,
resp = response_vars,
n_centers = 5)
KMNS(*, k=5): iter= 1, indx=4
QTRAN(): istep=1322, icoun=2
QTRAN(): istep=2644, icoun=20
QTRAN(): istep=3966, icoun=96
QTRAN(): istep=5288, icoun=0
QTRAN(): istep=6610, icoun=13
QTRAN(): istep=7932, icoun=285
KMNS(*, k=5): iter= 2, indx=13
QTRAN(): istep=1322, icoun=147
QTRAN(): istep=2644, icoun=138
QTRAN(): istep=3966, icoun=20
QTRAN(): istep=5288, icoun=599
KMNS(*, k=5): iter= 3, indx=20
QTRAN(): istep=1322, icoun=146
QTRAN(): istep=2644, icoun=617
KMNS(*, k=5): iter= 4, indx=1322
style_centers_certain_styles <- style_centers %>%
filter(style_collapsed %in% styles_to_keep)
Table of style vs. cluster.
kable(table(style = certain_styles_clustered$style_collapsed, cluster = certain_styles_clustered$cluster_assignment))
Now that we have a manageable number of styles, we can see how well fit each cluster is to each style. If the features we clustered on perfectly predicted style, there would each color (cluster) would be unique to each facet of the plot. (E.g., left entirely blue, second from left entirely green, etc.)
by_style_plot <- ggplot() +
geom_point(data = certain_styles_clustered,
aes(x = abv, y = ibu,
colour = cluster_assignment), alpha = 0.5) +
facet_grid(. ~ style_collapsed) +
geom_point(data = style_centers_certain_styles,
aes(mean_abv, mean_ibu), colour = "black", shape = 5) +
ggtitle("Selected Styles Cluster Assignment") +
labs(x = "ABV", y = "IBU") +
labs(colour = "Cluster") +
theme_bw()
by_style_plot
–>
Random asides into hops
Do more hops always mean more bitterness?
- It would appear so, from this graph and this regression (beta = 2.394418)
ggplot(data = beer_ingredients_join, aes(total_hops, ibu)) +
geom_point(aes(total_hops, ibu, colour = style_collapsed)) +
geom_smooth(method = lm, se = FALSE, colour = "black") +
ggtitle("Hops Per Beer vs. Bitterness") +
labs(x = "Number of Hops", y = "IBU", colour = "Style Collapsed") +
theme_minimal()
hops_ibu_lm <- lm(ibu ~ total_hops, data = beer_ingredients_join)
summary(hops_ibu_lm)
- However, past a certain point (3 hops or more), there’s no effect of number of hops on IBU
ggplot(data = beer_ingredients_join[which(beer_ingredients_join$total_hops > 2
& beer_ingredients_join$total_hops < 8), ], aes(total_hops, ibu)) +
geom_point(aes(total_hops, ibu, colour = style_collapsed)) +
geom_smooth(method = lm, se = FALSE, colour = "black") +
ggtitle("3+ Hops Per Beer vs. Bitterness") +
labs(x = "Number of Hops", y = "IBU", colour = "Style Collapsed") +
theme_minimal()
Most popular hops
# Gather up all the hops columns into one called `hop_name`
beer_necessities_hops_gathered <- beer_necessities %>%
gather(
hop_key, hop_name, hops_name_1:hops_name_13
) %>% as_tibble()
# Filter to just those beers that have at least one hop
beer_necessities_w_hops <- beer_necessities_hops_gathered %>%
filter(!is.na(hop_name)) %>%
filter(!hop_name == "")
beer_necessities_w_hops$hop_name <- factor(beer_necessities_w_hops$hop_name)
# For all hops, find the number of beers they're in as well as those beers' mean IBU and ABV
hops_beer_stats <- beer_necessities_w_hops %>%
ungroup() %>%
group_by(hop_name) %>%
summarise(
mean_ibu = mean(ibu, na.rm = TRUE),
mean_abv = mean(abv, na.rm = TRUE),
n = n()
)
# Pare to hops that are used in at least 50 beers
pop_hops_beer_stats <- hops_beer_stats[hops_beer_stats$n > 50, ]
kable(pop_hops_beer_stats)
# Keep just beers that contain these most popular hops
beer_necessities_w_popular_hops <- beer_necessities_w_hops %>%
filter(hop_name %in% pop_hops_beer_stats$hop_name) %>%
droplevels()
Are there certian hops that are used more often in very high IBU or ABV beers? Hard to detect a pattern
ggplot(data = beer_necessities_w_popular_hops) +
geom_point(aes(abv, ibu, colour = hop_name)) +
ggtitle("Beers Containing most Popular Hops") +
labs(x = "ABV", y = "IBU", colour = "Hop Name") +
theme_minimal()
ggplot(data = pop_hops_beer_stats) +
geom_point(aes(mean_abv, mean_ibu, colour = hop_name, size = n)) +
ggtitle("Most Popular Hops' Effect on Alcohol and Bitterness") +
labs(x = "Mean ABV per Hop Type", y = "Mean IBU per Hop Type", colour = "Hop Name",
size = "Number of Beers") +
theme_minimal()
Neural Net
- Can ABV, IBU, and SRM be used in a neural net to predict
style or style_collapsed?
- In the function, specify the dataframe and the outcome, either
style or style_collapsed; the one not specified as outcome will be dropped
- The predictor columns will be everything not specified in the vector
predictor_vars
- The function returns the outcome variable sleected, neural net output, variable importance, the prediction dataframe, predictions, and accuracy
library(nnet)
library(caret)
run_neural_net <- function(df, outcome, predictor_vars) {
out <- list(outcome = outcome)
# Create a new column outcome; it's style_collapsed if you set outcome to style_collapsed, and style otherwise
if (outcome == "style_collapsed") {
df[["outcome"]] <- df[["style_collapsed"]]
} else {
df[["outcome"]] <- df[["style"]]
}
df$outcome <- factor(df$outcome)
cols_to_keep <- c("outcome", predictor_vars)
df <- df %>%
select_(.dots = cols_to_keep) %>%
mutate(row = 1:nrow(df)) %>%
droplevels()
# Select 80% of the data for training
df_train <- sample_n(df, nrow(df)*(0.8))
# The rest is for testing
df_test <- df %>%
filter(! (row %in% df_train$row)) %>%
select(-row)
df_train <- df_train %>%
select(-row)
# Build multinomail neural net
nn <- multinom(outcome ~ .,
data = df_train, maxit=500, trace=T)
# Which variables are the most important in the neural net?
most_important_vars <- varImp(nn)
# How accurate is the model? Compare predictions to outcomes from test data
nn_preds <- predict(nn, type="class", newdata = df_test)
nn_accuracy <- postResample(df_test$outcome, nn_preds)
out <- list(out, nn = nn, most_important_vars = most_important_vars,
df_test = df_test,
nn_preds = nn_preds,
nn_accuracy = nn_accuracy)
return(out)
}
- Set the dataframe to be
beer_totals, the predictor variables to be the vector contained in p_vars, the outcome to be style_collapsed
p_vars <- c("total_hops", "total_malt", "abv", "ibu", "srm", "glass")
nn_collapsed_out <- run_neural_net(df = beer_totals, outcome = "style_collapsed",
predictor_vars = p_vars)
Error in nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE, :
too many (1422) weights
- What if we predcit
style instead of style_collapsed?
nn_notcollapsed_out <- run_neural_net(df = beer_totals, outcome = "style",
predictor_vars = p_vars)
nn_notcollapsed_out$nn_accuracy
nn_notcollapsed_out$most_important_vars
And now if we add glass as a predictor?
p_vars_add_glass <- c("total_hops", "total_malt", "abv", "ibu", "srm")
nn_collapsed_out_add_glass <- run_neural_net(df = beer_ingredients_join, outcome = "style_collapsed",
predictor_vars = p_vars_no_glass)
nn_collapsed_out_add_glass$nn_accuracy
nn_collapsed_out_add_glass$most_important_vars
Random forest with all ingredients
- We can use a random forest to get even more granular with ingredients
- The sparse ingredient dataframe was too complex for the multinomial neural net; however, we can
- Here we don’t include
glass as a predictor
library(ranger)
library(stringr)
bi <- beer_ingredients_join %>%
select(-c(id, name, cluster_assignment, style, hops_name, malt_name,
glass)) %>%
mutate(row = 1:nrow(.))
bi$style_collapsed <- factor(bi$style_collapsed)
# ranger complains about special characters and spaces in ingredient column names. Take them out and replace with empty string.
names(bi) <- tolower(names(bi))
names(bi) <- str_replace_all(names(bi), " ", "")
names(bi) <- str_replace_all(names(bi), "([\\(\\)-\\/')]+)", "")
# Keep 80% for training
bi_train <- sample_n(bi, nrow(bi)*(0.8))
# The rest is for testing
bi_test <- bi %>%
filter(! (row %in% bi_train$row)) %>%
dplyr::select(-row)
bi_train <- bi_train %>%
dplyr::select(-row)
bi_rf <- ranger(style_collapsed ~ ., data = bi_train, importance = "impurity", seed = 11)
OOB (out of bag) prediction error is around 58% * This calculated from tree samples constructed but not used in training set; these trees become effectively part of test set
bi_rf
We can compare predicted classification on the test set to their actual style classification.
pred_bi_rf <- predict(bi_rf, dat = bi_test)
table(bi_test$style_collapsed, pred_bi_rf$predictions)
Variable importance * Interestingly, ABV, IBU, and SRM are all much more important in the random forest than total_hops and total_malt
importance(bi_rf)[1:10]
How does a CSRF (case-specific random forest) fare?
bi_csrf <- csrf(style_collapsed ~ ., training_data = bi_train, test_data = bi_test,
params1 = list(num.trees = 5, mtry = 4),
params2 = list(num.trees = 2))
csrf_acc <- postResample(bi_csrf, bi_test$style_collapsed)
csrf_acc
Final Thoughts
Style first, forgiveness later?
- One reason seems that beers are generally brewed with style in mind first (“let’s make a pale ale”) rather than deciding the beer’s style after determining its characteristics and idiosyncrasies
- Even if the beer turns out more like a sour, and in a blind taste test might be classified as a sour more often than a pale ale, it still gets the label pale ale
- This makes the style definitions broader and harder to predict
Future Directions
- Incorporate flavor profiles for beers sourced/scraped from somewhere
- Implement a GAN to come up with beer names
- More on the hops deep dive: which hops are used most often in which styles?
---
title: Data Science Musings on Beer
author:
  name: Amanda Dobbyn
date: '`r format(Sys.time(), "%B %d, %Y")`'
output:
  html_notebook:
    toc: false
    theme: yeti
  pdf_document:
    toc: false
  github_document:
    toc: true
    
# output:
#   html_document:
#     keep_md: true
#     toc: false
#     theme: yeti
#   github_document:
#     toc: true
---



```{r setup, include=FALSE}
library(knitr)

# knitr::opts_knit$set(root.dir=normalizePath('../'))
knitr::opts_chunk$set(echo = FALSE, warning = FALSE)
knitr::opts_chunk$set(fig.width=12, fig.height=8) 
options(knitr.table.format = 'markdown')

# source("./read_from_db.R")
source("./most_popular_styles.R")
# # source("./cluster.R")
# # source("./plot.R")
# source("./get_ingredient_levels.R")
```

This is a first pass exploration of different aspects of beer.

* Data courtesy of [BreweryDB](http://www.brewerydb.com/developers)
    * Special thanks to [Kris Kroski](https://kro.ski/) for data ideation and beer
    
    
*** 


* The main question on the table is this:
    * Are beer styles actually indicative of shared attributes of the beers within that style? Or are style boundaries more or less arbitrary?
      * Two approaches: clustering and prediction 
      * Clustering: are there natural clusters across the spectum of beers that align well with the styles they're grouped into? 
          * Unsupervised (k-means) clustering based on 
              * ABV (alcohol by volume), IBU (international bitterness units), SRM (measure of color)
              * How well do these match up with various "style centers," defined by mean of ABV, IBU, and SRM per beer style
      * Prediction: can we predict a beer's style based on certain characteristics of the beer?
          * Neural net 
          * Random forest
      
* Answer thus far
    * Beer-intrinsic attributes aren't great predictors of style
    * Style-defined attributes are the best predictors
        * For instance, the glass a beer is served in (which is defined by its style) is a much better predictor of its style than actual characteristics of the beer like ABV and even the number of different types of hops it contains

![](./taps.jpg)



### Workflow Overview

* Hit the BreweryDB API to iteratively pull in all beers and their ingredients along with other things we might want like breweries and glassware
* Unnest the JSON responses, including all the ingredients columns, and 
* Dump this all into a MySQL database 

* Create a `style_collapsed` column to reduce the number of levels of our outcome variable
    * `grep` through each beer's style to determine if that style contains a keyword that qualifies it to be rolled into a collapsed style
    * If it does, it gets that keyword in a `style_collapsed` column 
    * Further collpase styles that are similar like Hefeweizen and Wit into Wheat
    
* Unnest the ingredients `hops` and `malts` into a sparse matrix
    * Individual ingredients as columns, beers as rows; cell gets a 1 if ingredient is present and 0 otherwise 
    
* Cluster: unsupervised k-means clsutering based on ABV, IBU, and SRM

* Run a neural net
    * Predict either `style` or `style_collapsed` from all the predictors including the total number of hops and malts per beer


**Short Aside**

The question of what should be a predictor variable for style is a bit murky here. What should be fair game for predicting style and what shouldn't? Characteristics of a beer that are defined *by* its style would seem to be "cheating" in a way. 

* Main candidates are:
    * ABV (alcohol by volume), IBU (international bitterness units), SRM (standard reference measure, a scale of beer color from light to dark)
        * These are outputs of a beer that meaningfully define the beer and are theoretically orthogonal to each other
    * Ingredients in a beer such as hops and malts
        * Inputs to a beer that have some effect on its flavor profile
        * Semi-cheating because if style is determined beforehand it likely determines at least in part which ingredients are added 
    * Glass type
        * This is defined entirely by style and is very predictive of it



**This document compiled by querying the beer database I built, specifically by sourcing the file read_from_db.R. This is done for expediency's sake, (the code below detailing how to get the beer, run in full in run_it.R, takes some time to execute).** 


### Get and Prepare Data

**Getting beer, the age-old dilemma**

* The BreweryDB API returns a certain number of results per page; if we want 
* So, we hit the BreweryDB API and ask for `1:number_of_pages`
    * We can change `number_of_pages` to, e.g., 3 if we only want the first 3 pages
    * If there's only one page (as is the case for the glassware endpoing), numberOfPages won't be returned, so in this case we set number_of_pages to 1
* The `addition` parameter can be an empty string if nothing else is needed

```{r, eval = FALSE, echo=TRUE}

base_url <- "http://api.brewerydb.com/v2"
key_preface <- "/?key="

paginated_request <- function(ep, addition) {    
  full_request <- NULL
  first_page <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
                                , "&p=1"))
  number_of_pages <- ifelse(!(is.null(first_page$numberOfPages)), 
                            first_page$numberOfPages, 1)      

    for (page in 1:number_of_pages) {                               
    this_request <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
                                    , "&p=", page, addition),
                             flatten = TRUE) 
    this_req_unnested <- unnest_it(this_request)    #  <- request unnested here
    print(this_req_unnested$currentPage)
    full_request <- bind_rows(full_request, this_req_unnested[["data"]])
  }
  full_request
} 

all_beer_raw <- paginated_request("beers", "&withIngredients=Y")
```



* Function for unnesting JSON used inside `paginated_request()` below
    + Takes the column named `name` nested within a column in the data portion of the response
        + If the `name` column doesn't exist, it takes the first nested column
* We use something similar to unnest ingredient like all of a beer's hops and malts into a long string contained in `hops_name` and `malt_name`

```{r, eval=FALSE, echo=TRUE}
unnest_it <- function(df) {
  unnested <- df
  for(col in seq_along(df[["data"]])) {
    if(! is.null(ncol(df[["data"]][[col]]))) {
      if(! is.null(df[["data"]][[col]][["name"]])) {
        unnested[["data"]][[col]] <- df[["data"]][[col]][["name"]]
      } else {
        unnested[["data"]][[col]] <- df[["data"]][[col]][[1]]
      }
    }
  }
  unnested
}
```



**Collapse Styles**

* Save the most popular styles in `keywords`
* Loop through each keyword
    * For each beer, `grep` through its style column to see if it contains any one of these keywords
    * If it does, give it that keyword in a new column `style_collapsed`
* If a beer's name matches multiple keywords, e.g., American Double India Pale Ale would match Double India Pale Ale, India Pale Ale, and Pale Ale, its `style_collapsed` is the **last** of those that appear in keywords 
    * This is why keywords are intentionally ordered from most general to most specific
    * So in the case of an case of American Double India Pale Ale: since Double India Pale Ale appears in `keywords` after India Pale Ale and Pale Ale, an American Double India Pale Ale would get a `style_collapsed` of Double India Pale Ale
* If no keyword is contained in `style`, `style_collapsed` is just whatever's in `style`; in other words, it doesn't get collpsed into a bigger bucket
    * This isn't a huge problem because we'll pare down to just the most popular styles later, however we could think about creating a catchall "Other" level for `style_collapsed`

```{r, eval=FALSE, echo=TRUE}

collapse_styles <- function(df) {
  keywords <- c("Lager", "Pale Ale", "India Pale Ale", "Double India Pale Ale", "India Pale Lager", "Hefeweizen", "Barrel-Aged","Wheat", "Pilsner", "Pilsener", "Amber", "Golden", "Blonde", "Brown", "Black", "Stout", "Porter", "Red", "Sour", "Kölsch", "Tripel", "Bitter", "Saison", "Strong Ale", "Barley Wine", "Dubbel", "Altbier")
  
  for (beer in 1:nrow(df)) {
    if (grepl(paste(keywords, collapse="|"), df$style[beer])) {    
      for (keyword in keywords) {         
        if(grepl(keyword, df$style[beer]) == TRUE) {
          df$style_collapsed[beer] <- keyword    
        }                         
      } 
    } else {
      df$style_collapsed[beer] <- as.character(df$style[beer])       
    }
    print(df$style_collapsed[beer])
  }
  return(df)
}

```

* Then we collapse further; right now we just combine all wheaty bears into Wheat and Pils-like beers into Pilsener (with two e's) by `fct_collapse`ing those levels

```{r, echo=TRUE, eval=FALSE}
collapse_further <- function(df) {
  df[["style_collapsed"]] <- df[["style_collapsed"]] %>%
    fct_collapse(
      "Wheat" = c("Hefeweizen", "Wheat"),
      "Pilsener" = c("Pilsner", "American-Style Pilsener") # pilsener == pilsner == pils
    )
  return(df)
}
```



**Split out Ingredients**

* When we unnested ingredients, we just concatenated all of the ingredients for a given beer into a long string
* If we want, we can split out the ingredients that were concatenated in `<ingredient>_name` with this `split_ingredients` function
* This takes a vector of `ingredients_to_split`, so e.g. `c("hops_name", "malt_name")` and creates one column for each type of ingredient (`hops_name_1`, `hops_name_2`, etc.)

* We `str_split` on the ingredient and get a list back
* We find the max number of instances of an ingredient per beer, which will be the number of columns we're adding
* For each new column we need, we create it, initialize it with NAs, and name it
* Then for each element in our list of split up ingredients, if it exists, we add it to the correct column in our df

```{r, eval=FALSE, echo=TRUE}
split_ingredients <- function(df, ingredients_to_split) {
  
  ncol_df <- ncol(df)
  
  for (ingredient in ingredients_to_split) {

    ingredient_split <- str_split(df[[ingredient]], ", ")    
    num_new_cols <- max(lengths(ingredient_split))    
  
    for (num in 1:num_new_cols) {
      
      this_col <- ncol_df + 1         
      
      df[, this_col] <- NA
      names(df)[this_col] <- paste0(ingredient, "_", num)
      ncol_df <- ncol(df)             
      for (row in seq_along(ingredient_split)) {          
        if (!is.null(ingredient_split[[row]][num])) {        
          df[row, this_col] <- ingredient_split[[row]][num]
        }
      }
      df[[names(df)[this_col]]] <- factor(df[[names(df)[this_col]]])
    }
    
    ncol_df <- ncol(df)
  }
  return(df)
}
```



**Find the Most Popualar Styles**

* Find mean ABV, IBU, and SRM per collapsed style
* Arrange collapsed styles by the number of beers that fall into them
    * This is of course dependent on how we collapse styles

```{r, eval=FALSE, echo=TRUE}
library(forcats)

# Pare down to only cases where style is not NA
beer_dat <- beer_necessities

beer_dat_pared <- beer_dat[complete.cases(beer_dat$style), ]

# Arrange beer dat by style popularity
style_popularity <- beer_dat_pared %>% 
  group_by(style) %>% 
  count() %>% 
  arrange(desc(n))
style_popularity

# Add a column that scales popularity
style_popularity <- bind_cols(style_popularity, 
                               n_scaled = as.vector(scale(style_popularity$n)))

# Find styles that are above a z-score of 0
popular_styles <- style_popularity %>% 
  filter(n_scaled > 0)

# Pare dat down to only beers that fall into those styles
popular_beer_dat <- beer_dat_pared %>% 
  filter(
    style %in% popular_styles$style
  ) %>% 
  droplevels() %>% 
  as_tibble() 
nrow(popular_beer_dat)

# Find the centers (mean abv, ibu, srm) of the most popular styles
style_centers <- popular_beer_dat %>% 
  group_by(style_collapsed) %>% 
  add_count() %>% 
  summarise(
    mean_abv = mean(abv, na.rm = TRUE),
    mean_ibu = mean(ibu, na.rm = TRUE), 
    mean_srm = mean(srm, na.rm = TRUE),
    n = median(n, na.rm = TRUE)          # Median here only for summarise. Should be just the same as n
  ) %>% 
  arrange(desc(n)) %>% 
  drop_na() %>% 
  droplevels()

# Give some nicer names
style_centers_rename <- style_centers %>% 
  rename(
    `Collapsed Style` = style_collapsed,
    `Mean ABV` = mean_abv,
    `Mean IBU` = mean_ibu,
    `Mean SRM` = mean_srm,
    `Numer of Beers` = n
  )
```


Take a look at the table      

```{r}
kable(style_centers_rename)
```


***

Now that the munging is done, onto the main question: do natural clusters in beer align with style boundaries?




***

### Ingredients

To get more granular with ingredients, we can split out each individual ingredient into its own column. If a beer or style contains that ingredient, its row gets a 1 in that ingredient column and a 0 otherwise.

From this, we can find the total number of hops and malts per grouper.

* The dataframe we'll use will be `beer_necessities`


<!-- ```{r, eval=TRUE, echo=TRUE} -->
<!-- clustered_beer_necessities <- clustered_beer %>% -->
<!--   inner_join(beer_necessities) -->

<!-- ``` -->

* This function takes a dataframe and two other parameters set at the outset:
    * `ingredient_want`: this can be `hops`, `malt`, or other ingredients like `yeast` if we pull that in
    * `grouper`: can be a vector of one or more things to group by, like beer `name` or `style`

```{r, eval=TRUE, echo=TRUE}

pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
  
  # ----------------------- Setup --------------------------- #
  # We've already split ingredient number names out from the concatenated string into columns like `malt_name_1`,
  # `malt_name_2`, etc. We need to find the range of these columns; there will be a different number of malt
  # columns than hops columns, for instance. The first one will be `<ingredient>_name_1` and from this we can find
  # the index of this column in our dataframe. We get the name of last one with the `get_last_ing_name_col()`
  # function. Then we save a vector of all the ingredient column names in `ingredient_colnames`. It will stay
  # constant even if the indices change when we select out certain columns. 
  
  # First ingredient
  first_ingredient_name <- paste(ingredient_want, "_name_1", sep="")
  first_ingredient_index <- which(colnames(df)==first_ingredient_name)
  
  # Get the last ingredient
  get_last_ing_name_col <- function(df) {
    for (col in names(df)) {
      if (grepl(paste(ingredient_want, "_name_", sep = ""), col) == TRUE) {
        name_last_ing_col <- col
      }
    }
    return(name_last_ing_col)
  }
  
  # Last ingredient
  last_ingredient_name <- get_last_ing_name_col(df)
  last_ingredient_index <- which(colnames(df)==last_ingredient_name)
  
  # Vector of all the ingredient column names
  ingredient_colnames <- names(df)[first_ingredient_index:last_ingredient_index]
  
  # Non-ingredient column names we want to keep
  to_keep_col_names <- c("cluster_assignment", "name", "abv", "ibu", "srm", "style", "style_collapsed")
  
  # -------------------------------------------------------------------------------# 
  
  # Inside `gather_ingredients()` we take out superflous column names that are not in `to_keep_col_names` or one 
  # of the ingredient columns, find what the new ingredient column indices are, since they'll have changed after 
  # we pared down and then gather all of the ingredient columns (e.g., `hops_name_1`) into one long column, 
  # `ing_keys` and all the actual ingredient names (e.g., Cascade) into `ing_names`.
  
  # ----------------------------- Gather columns --------------------------------- #
  gather_ingredients <- function(df, cols_to_gather) {
    to_keep_indices <- which(colnames(df) %in% to_keep_col_names)
    
    selected_df <- df[, c(to_keep_indices, first_ingredient_index:last_ingredient_index)]
    
    new_ing_indices <- which(colnames(selected_df) %in% cols_to_gather)    # indices will have changed since we pared down 
    
    df_gathered <- selected_df %>%
      gather_(
        key_col = "ing_keys",
        value_col = "ing_names",
        gather_cols = colnames(selected_df)[new_ing_indices]
      ) %>%
      mutate(
        count = 1
      )
    df_gathered
  }
  beer_gathered <- gather_ingredients(df, ingredient_colnames)  # ingredient colnames defined above function
  # ------------------------------------------------------------------------------- # 
  
  # Next we get a vector of all ingredient levels and take out the one that's an empty string and 
  # use this vector of ingredient levels in `select_spread_cols()` below

  # Get a vector of all ingredient levels
  beer_gathered$ing_names <- factor(beer_gathered$ing_names)
  ingredient_levels <- levels(beer_gathered$ing_names) 
  
  # Take out the level that's just an empty string
  to_keep_levels <- !(c(1:length(ingredient_levels)) %in% which(ingredient_levels == ""))
  ingredient_levels <- ingredient_levels[to_keep_levels]
  
  beer_gathered$ing_names <- as.character(beer_gathered$ing_names)
  
  # ----------------------------------------------------------------------------- # 
  
  # Then we spread the ingredient names: we take what was previously the `value` in our gathered dataframe, the
  # actual ingredient names (Cascade, Centennial) and make that our `key`; it'll form the new column names. The
  # new `value` is `value` is count; it'll populate the row cells. If a given row has a certain ingredient, it
  # gets a 1 in the corresponding cell, an NA otherwise. 
  # We add a unique idenfitier for each row with `row`, which we'll drop later (see [Hadley's SO
  # comment](https://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr)).

  
  # ------------------------------- Spread columns -------------------------------- #
  spread_ingredients <- function(df) {
    df_spread <- df %>% 
      mutate(
        row = 1:nrow(df)        # Add a unique idenfitier for each row which we'll need in order to spread; we'll drop this later
      ) %>%                                 
      spread(
        key = ing_names,
        value = count
      ) 
    return(df_spread)
  }
  beer_spread <- spread_ingredients(beer_gathered)
  # ------------------------------------------------------------------------------- # 

  
  # ------------------------- Select only certain columns ------------------------- #
  select_spread_cols <- function(df) {
    to_keep_col_indices <- which(colnames(df) %in% to_keep_col_names)
    to_keep_ingredient_indices <- which(colnames(df) %in% ingredient_levels)
    
    to_keep_inds_all <- c(to_keep_col_indices, to_keep_ingredient_indices)
    
    new_df <- df %>% 
      select_(
        .dots = to_keep_inds_all
      )
    return(new_df)
  }
  beer_spread_selected <- select_spread_cols(beer_spread)
  # ------------------------------------------------------------------------------- # 

  # Take out all rows that have no ingredients specified at all
  inds_to_remove <- apply(beer_spread_selected[, first_ingredient_index:last_ingredient_index], 
                          1, function(x) all(is.na(x)))
  beer_spread_no_na <- beer_spread_selected[ !inds_to_remove, ]
  
  
  # ----------------- Group ingredients by the grouper specified ------------------- #
  # Then we do the final step and group by the groupers.
  
  get_ingredients_per_grouper <- function(df, grouper = grouper) {
    df_grouped <- df %>%
      ungroup() %>% 
      group_by_(grouper)
    
    not_for_summing <- which(colnames(df_grouped) %in% to_keep_col_names)
    max_not_for_summing <- max(not_for_summing)
    
    per_grouper <- df_grouped %>% 
      select(-c(abv, ibu, srm)) %>%    # taking out temporarily
      summarise_if(
        is.numeric,              
        sum, na.rm = TRUE
        # -c(abv, ibu, srm)
      ) %>%
      mutate(
        total = rowSums(.[(max_not_for_summing + 1):ncol(.)], na.rm = TRUE)    
      )
    
    # Send total to the second position
    per_grouper <- per_grouper %>% 
      select(
        name, total, everything()
      )
    
    # Replace total column with more descriptive name: total_<ingredient>
    names(per_grouper)[which(names(per_grouper) == "total")] <- paste0("total_", ingredient_want)
    
    return(per_grouper)
  }
  # ------------------------------------------------------------------------------- # 
  
  ingredients_per_grouper <- get_ingredients_per_grouper(beer_spread_selected, grouper)
  return(ingredients_per_grouper)
}
```


* Now run the function with `ingredient_want` as first hops, then malt
* Then join the resulting dataframes and remove/reorder some columns

```{r, echo=TRUE, eval=TRUE}
# Run the entire function with ingredient_want set to hops, grouping by name
ingredients_per_beer_hops <- pick_ingredient_get_beer(ingredient_want = "hops", 
                                                      beer_necessities, 
                                                      grouper = c("name", "style_collapsed"))

# Same for malt
ingredients_per_beer_malt <- pick_ingredient_get_beer(ingredient_want = "malt", 
                                                      beer_necessities, 
                                                      grouper = c("name", "style_collapsed"))

# Join those on our original dataframe by name
beer_ingredients_join_first_ingredient <- left_join(beer_necessities, ingredients_per_beer_hops,
                                                    by = "name")
beer_ingredients_join <- left_join(beer_ingredients_join_first_ingredient, ingredients_per_beer_malt,
                                   by = "name")


# Take out some unnecessary columns
unnecessary_cols <- c("styleId", "abv_scaled", "ibu_scaled", "srm_scaled", 
                      "hops_id", "malt_id", "glasswareId", "style.categoryId")
beer_ingredients_join <- beer_ingredients_join[, (! names(beer_ingredients_join) %in% unnecessary_cols)]


# If we also want to take out any of the malt_name_1, malt_name_2, etc. columns we can do this with a grep
more_unnecessary <- c("hops_name_|malt_name_")
beer_ingredients_join <- 
  beer_ingredients_join[, (! grepl(more_unnecessary, names(beer_ingredients_join)) == TRUE)]

# Reorder columns a bit
beer_ingredients_join <- beer_ingredients_join %>% 
  select(
    id, name, total_hops, total_malt, everything(), -description
  )

# And get a df that includes total_hops and total_malt but not all the other ingredient columns
beer_totals <- beer_ingredients_join %>% 
  select(
    id, name, total_hops, total_malt, style, style_collapsed,
    abv, ibu, srm, glass, hops_name, malt_name
  )

```



Now we're left with something of a sparse matrix of all the ingredients compared to all the beers
```{r}
kable(beer_ingredients_join[1:20, ])
```




***

### Unsupervised Clustering 
We K-means cluster beers based on certain numeric predictor variables. 


**Prep**

* Write a funciton that takes a dataframe, a set of predictors, a response variable, and the number of cluster centers you want
    * NB: There are not not very many beers have SRM so we may not want to omit based on it

* Take out missing values, and scale the data
* Take out outliers, defined as beers have to have an ABV between 3 and 20 and an IBU less than 200
* Then cluster on just the predictors and compare to the response variable
  

```{r, echo=TRUE}

library(NbClust)

cluster_it <- function(df, preds, to_scale, resp, n_centers) {
  df_for_clustering <- df %>%
    select_(.dots = c(response_vars, cluster_on)) %>%
    na.omit() %>%
    filter(
      abv < 20 & abv > 3
    ) %>%
    filter(
      ibu < 200
    )

  df_all_preds <- df_for_clustering %>%
    select_(.dots = preds)

  df_preds_scale <- df_all_preds %>%
    select_(.dots = to_scale) %>%
    rename(
      abv_scaled = abv,
      ibu_scaled = ibu,
      srm_scaled = srm
    ) %>%
    scale() %>%
    as_tibble()

  df_preds <- bind_cols(df_preds_scale, df_all_preds[, (!names(df_all_preds) %in% to_scale)])

  df_outcome <- df_for_clustering %>%
    select_(.dots = resp) %>%
    na.omit()

  set.seed(9)
  clustered_df_out <- kmeans(x = df_preds, centers = n_centers, trace = TRUE)

  clustered_df <- as_tibble(data.frame(
    cluster_assignment = factor(clustered_df_out$cluster),
    df_outcome, df_preds,
    df_for_clustering %>% select(abv, ibu, srm)))

  return(clustered_df)
}

```



 
**Cluster**

First we'll run the fuction with 10 centers, and cluster on the predictors ABV, IBU, SRM, total_hops, and total_malt.


```{r, echo=TRUE}

cluster_on <- c("abv", "ibu", "srm", "total_hops", "total_malt")
to_scale <- c("abv", "ibu", "srm", "total_hops", "total_malt")
response_vars <- c("name", "style", "style_collapsed")

clustered_beer <- cluster_it(df = beer_totals,
                             preds = cluster_on,
                             to_scale = to_scale,
                             resp = response_vars,
                             n_centers = 10)
```


Head of the clustering data
```{r}
kable(clustered_beer[1:20, ])
```



A table of cluster counts broken down by style
```{r}
cluster_table_counts <- table(style = clustered_beer$style_collapsed, cluster = clustered_beer$cluster_assignment)

kable(cluster_table_counts)
```


Plot the clusters. There are 3 axes: ABV, IBU, and SRM, so we choose two at a time. 

```{r, echo=TRUE}
clustered_beer_plot_abv_ibu <- ggplot(data = clustered_beer, aes(x = abv, y = ibu, colour = cluster_assignment)) + 
  geom_jitter() + theme_minimal()  +
  ggtitle("k-Means Clustering of Beer by ABV, IBU, SRM") +
  labs(x = "ABV", y = "IBU") +
  labs(colour = "Cluster Assignment")
clustered_beer_plot_abv_ibu

clustered_beer_plot_abv_srm <- ggplot(data = clustered_beer, aes(x = abv, y = srm, colour = cluster_assignment)) + 
  geom_jitter() + theme_minimal()  +
  ggtitle("k-Means Clustering of Beer by ABV, IBU, SRM") +
  labs(x = "ABV", y = "SRM") +
  labs(colour = "Cluster Assignment")
clustered_beer_plot_abv_srm
```



Now we can add in the style centers (means) for each `style_collapsed` and label it.

```{r, echo=TRUE}
library(ggrepel)
abv_ibu_clusters_vs_style_centers <- ggplot() +   
  geom_point(data = clustered_beer, 
             aes(x = abv, y = ibu, colour = cluster_assignment), alpha = 0.5) +
  geom_point(data = style_centers,
             aes(mean_abv, mean_ibu), colour = "black") +
  geom_text_repel(data = style_centers, aes(mean_abv, mean_ibu, label = style_collapsed), 
                  box.padding = unit(0.45, "lines"),
                  family = "Calibri",
                  label.size = 0.3) +
  ggtitle("Popular Styles vs. k-Means Clustering of Beer by ABV, IBU, SRM") +
  labs(x = "ABV", y = "IBU") +
  labs(colour = "Cluster Assignment") +
  theme_bw()
abv_ibu_clusters_vs_style_centers
```


The clustering above used a smaller number of clusters (10) than there are `styles_collapsed`. That makes it difficult to determine whether a given style fits snugly into a cluster or not.



**Cluster on just certain selected styles**

We'll take five very distinct collapsed styles and re-run the clustering on beers that fall into these categories. 
These styles were intentionally chosen because they are quite distinct: Blonde, IPA, Stout, Tripel, Wheat. Arguably, of these five styles Blondes and Wheats are the closest



```{r, echo=TRUE}
styles_to_keep <- c("Blonde", "India Pale Ale", "Stout", "Tripel", "Wheat")
bt_certain_styles <- beer_totals %>%
  filter(
    style_collapsed %in% styles_to_keep
  )


cluster_on <- c("abv", "ibu", "srm", "total_hops", "total_malt")
to_scale <- c("abv", "ibu", "srm", "total_hops", "total_malt")
response_vars <- c("name", "style", "style_collapsed")

certain_styles_clustered <- cluster_it(df = bt_certain_styles,
                                 preds = cluster_on,
                                 to_scale = to_scale,
                                 resp = response_vars,
                                 n_centers = 5)

style_centers_certain_styles <- style_centers %>% 
  filter(style_collapsed %in% styles_to_keep)
```





Table of style vs. cluster.
```{r}
kable(table(style = certain_styles_clustered$style_collapsed, cluster = certain_styles_clustered$cluster_assignment))
```



Now that we have a manageable number of styles, we can see how well fit each cluster is to each style. If the features we clustered on perfectly predicted style, there would each color (cluster) would be unique to each facet of the plot. (E.g., left entirely blue, second from left entirely green, etc.)



```{r, echo=TRUE}
by_style_plot <- ggplot() +   
  geom_point(data = certain_styles_clustered, 
             aes(x = abv, y = ibu,
                 colour = cluster_assignment), alpha = 0.5) +
  facet_grid(. ~ style_collapsed) +
  geom_point(data = style_centers_certain_styles,
           aes(mean_abv, mean_ibu), colour = "black", shape = 5) +
  ggtitle("Selected Styles Cluster Assignment") +
  labs(x = "ABV", y = "IBU") +
  labs(colour = "Cluster") +
  theme_bw()
by_style_plot
```




<!-- ### Back to clustering: cluster on only 5 styles -->

<!-- * We'll pare down the beer data to just beers in 5 selected styles; 
<!-- * We'll cluster these into 5 clusters -->
<!-- <!-- * This time we'll add in `total_hops` and `total_malt` as predictors  --> -->




<!-- ggplot() + -->
<!--   geom_point(data = certain_styles_clustered, -->
<!--              aes(x = abv, y = ibu, -->
<!--                  shape = cluster_assignment, -->
<!--                  colour = style_collapsed), alpha = 0.5) + -->
<!--   geom_point(data = style_centers_certain_styles, -->
<!--              aes(mean_abv, mean_ibu), colour = "black") + -->
<!--   geom_text_repel(data = style_centers_certain_styles, -->
<!--                   aes(mean_abv, mean_ibu, label = style_collapsed), -->
<!--                   box.padding = unit(0.45, "lines"), -->
<!--                   family = "Calibri", -->
<!--                   label.size = 0.3) + -->
<!--   ggtitle("Selected Styles (colors) matched with Cluster Assignments (shapes)") + -->
<!--   labs(x = "ABV", y = "IBU") + -->
<!--   labs(colour = "Style", shape = "Cluster Assignment") + -->
<!--   theme_bw() -->

<!-- ``` -->



## Random asides into hops

**Do more hops always mean more bitterness?**

* It would appear so, from this graph and this regression (beta = 2.394418)
```{r, echo=TRUE}

ggplot(data = beer_ingredients_join, aes(total_hops, ibu)) +
  geom_point(aes(total_hops, ibu, colour = style_collapsed)) +
  geom_smooth(method = lm, se = FALSE, colour = "black") + 
  ggtitle("Hops Per Beer vs. Bitterness") +
  labs(x = "Number of Hops", y = "IBU", colour = "Style Collapsed") +
  theme_minimal()

hops_ibu_lm <- lm(ibu ~ total_hops, data = beer_ingredients_join)
summary(hops_ibu_lm)
```

* However, past a certain point (3 hops or more), there's no effect of number of hops on IBU
```{r, echo=TRUE}
ggplot(data = beer_ingredients_join[which(beer_ingredients_join$total_hops > 2
                                          & beer_ingredients_join$total_hops < 8), ], aes(total_hops, ibu)) +
  geom_point(aes(total_hops, ibu, colour = style_collapsed)) +
  geom_smooth(method = lm, se = FALSE, colour = "black") +
  ggtitle("3+ Hops Per Beer vs. Bitterness") +
  labs(x = "Number of Hops", y = "IBU", colour = "Style Collapsed") +
  theme_minimal()
```


**Most popular hops**

```{r, echo=TRUE}
# Gather up all the hops columns into one called `hop_name`
beer_necessities_hops_gathered <- beer_necessities %>%
  gather(
    hop_key, hop_name, hops_name_1:hops_name_13
  ) %>% as_tibble()

# Filter to just those beers that have at least one hop
beer_necessities_w_hops <- beer_necessities_hops_gathered %>% 
  filter(!is.na(hop_name)) %>% 
  filter(!hop_name == "")

beer_necessities_w_hops$hop_name <- factor(beer_necessities_w_hops$hop_name)

# For all hops, find the number of beers they're in as well as those beers' mean IBU and ABV
hops_beer_stats <- beer_necessities_w_hops %>% 
  ungroup() %>% 
  group_by(hop_name) %>% 
  summarise(
    mean_ibu = mean(ibu, na.rm = TRUE), 
    mean_abv = mean(abv, na.rm = TRUE),
    n = n()
  )

# Pare to hops that are used in at least 50 beers
pop_hops_beer_stats <- hops_beer_stats[hops_beer_stats$n > 50, ]
kable(pop_hops_beer_stats)

# Keep just beers that contain these most popular hops
beer_necessities_w_popular_hops <- beer_necessities_w_hops %>% 
  filter(hop_name %in% pop_hops_beer_stats$hop_name) %>% 
  droplevels() 
```

Are there certian hops that are used more often in very high IBU or ABV beers?
Hard to detect a pattern
```{r, echo = TRUE}
ggplot(data = beer_necessities_w_popular_hops) + 
  geom_point(aes(abv, ibu, colour = hop_name)) +
  ggtitle("Beers Containing most Popular Hops") +
  labs(x = "ABV", y = "IBU", colour = "Hop Name") +
  theme_minimal()
```

```{r, echo=TRUE}
ggplot(data = pop_hops_beer_stats) + 
  geom_point(aes(mean_abv, mean_ibu, colour = hop_name, size = n)) +
  ggtitle("Most Popular Hops' Effect on Alcohol and Bitterness") +
  labs(x = "Mean ABV per Hop Type", y = "Mean IBU per Hop Type", colour = "Hop Name", 
       size = "Number of Beers") +
  theme_minimal()
```


# Neural Net

* Can ABV, IBU, and SRM be used in a neural net to predict `style` or `style_collapsed`?
* In the function, specify the dataframe and the outcome, either `style` or `style_collapsed`; the one not specified as `outcome` will be dropped
* The predictor columns will be everything not specified in the vector `predictor_vars`
* The function returns the outcome variable sleected, neural net output, variable importance, the prediction dataframe, predictions, and accuracy

```{r, warning=FALSE, echo=TRUE, eval=TRUE, message=FALSE}

library(nnet)
library(caret)

run_neural_net <- function(df, outcome, predictor_vars) {
  out <- list(outcome = outcome)
  
  # Create a new column outcome; it's style_collapsed if you set outcome to style_collapsed, and style otherwise
  if (outcome == "style_collapsed") {
    df[["outcome"]] <- df[["style_collapsed"]]
  } else {
    df[["outcome"]] <- df[["style"]]
  }

  df$outcome <- factor(df$outcome)
  
  cols_to_keep <- c("outcome", predictor_vars)
  
  df <- df %>%
    select_(.dots = cols_to_keep) %>%
    mutate(row = 1:nrow(df)) %>% 
    droplevels()

  # Select 80% of the data for training
  df_train <- sample_n(df, nrow(df)*(0.8))
  
  # The rest is for testing
  df_test <- df %>%
    filter(! (row %in% df_train$row)) %>%
    select(-row)
  
  df_train <- df_train %>%
    select(-row)
  
  # Build multinomail neural net
  nn <- multinom(outcome ~ .,
                 data = df_train, maxit=500, trace=T)

  # Which variables are the most important in the neural net?
  most_important_vars <- varImp(nn)

  # How accurate is the model? Compare predictions to outcomes from test data
  nn_preds <- predict(nn, type="class", newdata = df_test)
  nn_accuracy <- postResample(df_test$outcome, nn_preds)

  out <- list(out, nn = nn, most_important_vars = most_important_vars,
              df_test = df_test,
              nn_preds = nn_preds,
           nn_accuracy = nn_accuracy)

  return(out)
}

```

* Set the dataframe to be `beer_totals`, the predictor variables to be the vector contained in `p_vars`, the outcome to be `style_collapsed`

```{r, echo=TRUE, eval=TRUE}
p_vars <- c("total_hops", "total_malt", "abv", "ibu", "srm")

nn_collapsed_out <- run_neural_net(df = beer_totals, outcome = "style_collapsed", 
                         predictor_vars = p_vars)


# How accurate was it?
nn_collapsed_out$nn_accuracy

# What were the most important variables?
nn_collapsed_out$most_important_vars

```


* What if we predcit `style` instead of `style_collapsed`?

```{r, echo=TRUE}

nn_notcollapsed_out <- run_neural_net(df = beer_totals, outcome = "style", 
                         predictor_vars = p_vars)

nn_notcollapsed_out$nn_accuracy

nn_notcollapsed_out$most_important_vars

```


And now if we add `glass` as a predictor?
```{r, echo=TRUE}

p_vars_add_glass <- c("total_hops", "total_malt", "abv", "ibu", "srm")

nn_collapsed_out_add_glass <- run_neural_net(df = beer_ingredients_join, outcome = "style_collapsed", 
                         predictor_vars = p_vars_no_glass)

nn_collapsed_out_add_glass$nn_accuracy

nn_collapsed_out_add_glass$most_important_vars

```




### Random forest with all ingredients

* We can use a random forest to get even more granular with ingredients
    * The sparse ingredient dataframe was too complex for the multinomial neural net; however, we can 

* Here we don't include `glass` as a predictor

```{r, echo=TRUE}

library(ranger)
library(stringr)

bi <- beer_ingredients_join %>% 
  select(-c(id, name, cluster_assignment, style, hops_name, malt_name,
            glass)) %>% 
  mutate(row = 1:nrow(.)) 

bi$style_collapsed <- factor(bi$style_collapsed)


# ranger complains about special characters and spaces in ingredient column names. Take them out and replace with empty string.
names(bi) <- tolower(names(bi))
names(bi) <- str_replace_all(names(bi), " ", "")
names(bi) <- str_replace_all(names(bi), "([\\(\\)-\\/')]+)", "")

# Keep 80% for training
bi_train <- sample_n(bi, nrow(bi)*(0.8))

# The rest is for testing
bi_test <- bi %>%
  filter(! (row %in% bi_train$row)) %>%
  dplyr::select(-row)

bi_train <- bi_train %>%
  dplyr::select(-row)


bi_rf <- ranger(style_collapsed ~ ., data = bi_train, importance = "impurity", seed = 11)
```


OOB (out of bag) prediction error is around 58%
    * This calculated from tree samples constructed but not used in training set; these trees become effectively part of test set
```{r}
bi_rf
```


We can compare predicted classification on the test set to their actual style classification.
```{r, echo=TRUE}
pred_bi_rf <- predict(bi_rf, dat = bi_test)
table(bi_test$style_collapsed, pred_bi_rf$predictions)
```


Variable importance
* Interestingly, ABV, IBU, and SRM are all much more important in the random forest than `total_hops` and `total_malt`
```{r, echo=TRUE}

importance(bi_rf)[1:10]
```


How does a CSRF (case-specific random forest) fare?

```{r, echo=TRUE}

bi_csrf <- csrf(style_collapsed ~ ., training_data = bi_train, test_data = bi_test,
                params1 = list(num.trees = 5, mtry = 4),
                params2 = list(num.trees = 2))

csrf_acc <- postResample(bi_csrf, bi_test$style_collapsed)

csrf_acc
```


![](./pour.jpg)



### Final Thoughts


*Style first, forgiveness later?*

* One reason  seems that beers are generally brewed with style in mind first ("let's make a pale ale") rather than deciding the beer's style after determining its characteristics and idiosyncrasies 
    * Even if the beer turns out more like a sour, and in a blind taste test might be classified as a sour more often than a pale ale, it still gets the label pale ale
    * This makes the style definitions broader and harder to predict



*Future Directions*

* Incorporate flavor profiles for beers sourced/scraped from somewhere
* Implement a GAN to come up with beer names
* More on the hops deep dive: which hops are used most often in which styles?





